Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imp: exec: send SIGKILL to cgroup/children on SIGUSR1 #186

Merged
merged 11 commits into from
Nov 1, 2024

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Oct 30, 2024

This PR updates the IMP to translate SIGUSR1 to SIGKILL for flux-imp exec as defined in RFC 15.

Additionally, if possible, the IMP will signal all members of its cgroup (besides itself) when delivering SIGKILL.

Finally, to better handle returning the proper wait status to its parent, when the child of the IMP is terminated by a signal, the IMP raises that same signal to itself.

This is WIP pending real system testing.

@grondo
Copy link
Contributor Author

grondo commented Oct 30, 2024

Hm, readthedocs build seems to have been broken at some point as well. Will have to look into that.

@grondo grondo changed the title WIP: imp: exec: send SIGKILL to cgroup/children on SIGUSR1 imp: exec: send SIGKILL to cgroup/children on SIGUSR1 Oct 31, 2024
garlick
garlick previously approved these changes Oct 31, 2024
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

I have some comments but all superficial and/or optional.

src/imp/cgroup.c Outdated Show resolved Hide resolved
src/imp/cgroup.c Outdated Show resolved Hide resolved
src/imp/cgroup.c Show resolved Hide resolved
src/imp/cgroup.c Outdated Show resolved Hide resolved
src/imp/cgroup.c Show resolved Hide resolved
src/imp/signals.c Show resolved Hide resolved
src/test/docker/docker-run-checks.sh Show resolved Hide resolved
@grondo
Copy link
Contributor Author

grondo commented Oct 31, 2024

As part of this work I switched the cgroup controller that the IMP checks for from name=systemd to pids, since the latter can actually be used in testing. I had thought this was working, but after further investigation, that isn't quite right and the pids controller is unchanged for processes in a job managed by sdexec.

I think the correct thing (after perusing cgroups(7)) is to use the 0:: line in /proc/self/cgroup, which I think corresponds to the cgroups v2 hierarchy.

I'll have to modify this PR to make that change.

Problem: Signal forwarding is currently implemented directly in the
IMP exec subcommand which makes it impossible to reuse the code in
other IMP subcommands.

Move signal handling to imp/signals.[ch] for better code reuse.

Pass the imp state object into the signal handler for possible future
use, otherwise behavior of the IMP exec command should be identical to
before this change.
@grondo grondo force-pushed the issue#185 branch 2 times, most recently from 22462f7 to 4668b2c Compare November 1, 2024 00:11
Problem: To comply with RFC 15, the IMP needs to know attributes of
its current cgroup, but this information is not currently available.

Add a cgroup_info structure to the imp_state object, and populate it
at IMP startup.

This moves and modifies some internal code from pidinfo.c. In the
future, this duplicate code in pidinfo.c will be removed.
@grondo
Copy link
Contributor Author

grondo commented Nov 1, 2024

Ok, I think I've fixed the cgroup discovery here. The code is somewhat simplified, at startup the IMP:

  • checks if either /sys/fs/cgroup or /sys/fs/cgroup/unified are mounted as cgroup2
  • if not then check for a legacy (cgroup v1) mount at /sys/fs/cgroup/systemd
  • grab the current relative cgroup path for either case from /proc/self/cgroup (logic to find the correct one is now fixed)

I also had to add a workaround to remove leading /.. from the relative cgroup path, which seems to pop up when run within a container (the relative path is relative to the container mount point on the host I think)

Problem: RFC 15 specifies that the IMP should handle SIGUSR1 as a
surrogate for SIGKILL, but this is not currently supported.

Add cgroup_kill() which will send a signal to all member's of the
IMP's cgroup (besides the IMP's own pid).

Add SIGUSR1 to the set of signals handled by the signal forwarding
implementation of the IMP. In the case of SIGUSR1, deliver SIGKILL
to all members of the current cgroup if use_cgroup_kill is true,
otherwise fallback to pid_kill_children().
Problem: The lingering IMP exits with 127+signo if the invoked child
was terminated by a signal during `flux-imp exec`, but ideally the
IMP would exit with the exact wait status of the child.

Add a convenience function, `imp_raise()`, which raises a signal
after setting the handler to the default disposition for the signal in
question.  Use imp_raise() to deliver the same signal which terminated
the child in `flux-imp exec` if the child is detected to have exited
due to a signal.
Problem: Some tests in the flux-security testsuite may need write
access to cgroups, but docker-run-checks.sh explicitly mounts cgroups
as read-only.

Mount cgroups read-write for testing purposes.
Problem: The IMP exec tests that use sign-none.toml and sudo are
unnecessarily complicated because the IMP has to be run in a subshell,
sometimes in the background. This not only duplicates a lot of code,
but makes it impossible to get the exit code of the IMP when it is
run in the background.

Add a function to run `flux-imp exec` under sudo with the
sign-none.toml config. While we're at it, create the `sleeper.sh`
helper at the beginning of the tests, instead of as part of another
test.

Update callers to use `imp_exec_sign_none()` where possible.
Problem: There are no tests the ensure sending SIGUSR1 to the IMP
causes it to send SIGKILL to its children.

Add a couple tests that ensure SIGKILL is sent when the IMP is running
both in and out of an "imp-shell" cgroup.
Problem: pstree(1) is not available on all test platforms.

Remove its use in debugging tests.
Problem: The readthedocs build fails with

 sphinx.errors.VersionRequirementError: 5.0

Update doc/requirements.txt to match those in flux-core, including
an updated sphinx version requirement.
Problem: Some distros don't seem to have CGROUP2_SUPER_MAGIC defined
in linux/magic.h.

Copy the magic value from a system that has it. Remove the ifndef
CGROUP2_SUPER_MAGIC block since it will now always be defined.
Problem: Relative paths for cgroups in /proc/self/cgroup may contain
leading `/..` when the IMP is run inside some containers. This is
because the path is relative to the container's mount point, not the
actual mount point of the cgroupfs.

Just strip the leading /.. from paths during discovery of the current
cgroup path from `/proc/self/cgroup`. This should only apply in
containers and thus CI.
@garlick
Copy link
Member

garlick commented Nov 1, 2024

I grabbed your snapshot from last night and put it on my test cluster. This morning I made some minimal changes to job-exec on my issue#6406 branch to send SIGUSR1 instead of imp-kill when shell termination methods fail. Then I ran a test on my system by starting a long running job, sending the flux-shell a SIGSTOP, then canceling the job. Looks good! Flux on rank 0 logged:

[  +5.002092] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +10.005491] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +15.005766] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +20.005115] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +25.005382] job-exec[0]: Sending SIGKILL to job ƒ3L9MFPYEKZ
[ +25.005415] job-exec[0]: Sending SIGUSR1 to IMP for job ƒ3L9MFPYEKZ
[ +25.169307] job-manager[0]: housekeeping: ƒ3L9MFPYEKZ started
[ +25.201629] job-list[0]: purged 1 inactive jobs
[ +25.203373] job-manager[0]: purged 1 inactive jobs
[Nov01 08:43] job-manager[0]: housekeeping: ƒ3L9MFPYEKZ complete

On the compute node, systemd logged

Nov 01 08:41:05 picl1 systemd[796]: Starting imp-shell-1-f3L9MFPYEKZ.service - User workload...
Nov 01 08:41:05 picl1 systemd[796]: Started imp-shell-1-f3L9MFPYEKZ.service - User workload.
Nov 01 08:42:32 picl1 systemd[796]: imp-shell-1-f3L9MFPYEKZ.service: Sent signal SIGTERM to main process 907 (flux-imp) on client request.
Nov 01 08:42:57 picl1 systemd[796]: imp-shell-1-f3L9MFPYEKZ.service: Sent signal SIGUSR1 to main process 907 (flux-imp) on client request.
Nov 01 08:42:57 picl1 systemd[796]: imp-shell-1-f3L9MFPYEKZ.service: Main process exited, code=killed, status=9/KILL
Nov 01 08:42:57 picl1 systemd[796]: imp-shell-1-f3L9MFPYEKZ.service: Failed with result 'signal'.

@grondo
Copy link
Contributor Author

grondo commented Nov 1, 2024

This morning I added one fix here: cgroup_kill() was always sending SIGKILL instead of the sig parameter. Fixed.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@grondo
Copy link
Contributor Author

grondo commented Nov 1, 2024

Great, thanks! I'll set MWP. I'll post a PR to have the IMP wait until its cgroup is empty in a bit.

@mergify mergify bot merged commit 1d7cf73 into flux-framework:master Nov 1, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants